matthews_corrcoef (Matthews Correlation Coefficient, MCC)#

The Matthews correlation coefficient (MCC) is a single-number summary of a classifier’s confusion matrix. It can be interpreted as the Pearson correlation between true and predicted labels, so it naturally lives in \([-1, 1]\):

  • \(+1\): perfect predictions

  • \(0\): no better than random (no correlation)

  • \(-1\): perfectly wrong (systematic inversion)

MCC is especially useful when classes are imbalanced, because it uses all four confusion-matrix entries (TP, TN, FP, FN).


Learning goals#

  • derive MCC from the confusion matrix and from Pearson correlation

  • implement MCC from scratch in NumPy (binary + multiclass)

  • build intuition with Plotly visuals (imbalance + thresholding)

  • use MCC to select a decision threshold / tune a simple model

Table of contents#

  1. Confusion matrix recap

  2. Binary MCC: definition + correlation view

  3. Multiclass MCC

  4. NumPy implementation (from scratch)

  5. Intuition plots (TPR/TNR surface + imbalance trap)

  6. Using MCC for optimization: threshold tuning for logistic regression

  7. Pros, cons, and when to use MCC

  8. Exercises + references

import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.datasets import make_classification
from sklearn.metrics import matthews_corrcoef as sk_matthews_corrcoef
from sklearn.model_selection import train_test_split

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(7)

1) Confusion matrix recap#

For binary classification, assume the positive class is labeled \(1\) and the negative class is labeled \(0\).

A confusion matrix counts outcomes:

predicted \(1\)

predicted \(0\)

true \(1\)

TP

FN

true \(0\)

FP

TN

With total sample size:

\[ N = \text{TP} + \text{TN} + \text{FP} + \text{FN}. \]

Useful rates:

  • TPR / recall / sensitivity: \(\text{TPR} = \frac{\text{TP}}{\text{TP}+\text{FN}}\)

  • TNR / specificity: \(\text{TNR} = \frac{\text{TN}}{\text{TN}+\text{FP}}\)

MCC “wants” both TPR and TNR to be high.

2) Binary MCC: definition + correlation view#

2.1 Definition (confusion-matrix form)#

The (binary) Matthews correlation coefficient is

\[ \mathrm{MCC} = \frac{\text{TP}\,\text{TN} - \text{FP}\,\text{FN}} {\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}. \]
  • The numerator rewards agreement (TP·TN) and penalizes disagreement (FP·FN).

  • The denominator normalizes to keep the score in \([-1, 1]\).

If the denominator is \(0\) (e.g. constant predictions, or all labels are the same), MCC is mathematically undefined. In practice (and in scikit-learn), it is returned as 0.0.

2.2 MCC = Pearson correlation for 0/1 labels#

Let \(Y, \hat Y \in \{0,1\}\) be the true and predicted labels. The Pearson correlation is

\[ \rho(Y, \hat Y) = \frac{\mathrm{Cov}(Y, \hat Y)}{\sqrt{\mathrm{Var}(Y)\,\mathrm{Var}(\hat Y)}}. \]

Using the contingency table above:

  • \(\mathbb{E}[Y] = \frac{\text{TP}+\text{FN}}{N}\)

  • \(\mathbb{E}[\hat Y] = \frac{\text{TP}+\text{FP}}{N}\)

  • \(\mathbb{E}[Y\hat Y] = \frac{\text{TP}}{N}\)

So

\[ \mathrm{Cov}(Y,\hat Y) = \mathbb{E}[Y\hat Y] - \mathbb{E}[Y]\,\mathbb{E}[\hat Y] = \frac{\text{TP}\,\text{TN} - \text{FP}\,\text{FN}}{N^2}. \]

And

\[ \mathrm{Var}(Y) = \frac{(\text{TP}+\text{FN})(\text{TN}+\text{FP})}{N^2}, \quad \mathrm{Var}(\hat Y) = \frac{(\text{TP}+\text{FP})(\text{TN}+\text{FN})}{N^2}. \]

Plugging these into \(\rho\) yields the MCC formula. This is why MCC is also known as the phi coefficient (correlation for two binary variables).

3) Multiclass MCC#

MCC has a natural multiclass extension based on the full \(K\times K\) confusion matrix.

Let \(C\in\mathbb{N}^{K\times K}\) with entries:

\[ C_{ij} = \#\{n : y^{(n)}=i, \; \hat y^{(n)}=j\}. \]

Define:

  • \(s = \sum_{i,j} C_{ij}\) (total)

  • \(c = \sum_k C_{kk}\) (correct / trace)

  • \(t_k = \sum_j C_{k j}\) (true count per class; row sums)

  • \(p_k = \sum_i C_{i k}\) (predicted count per class; column sums)

Then the multiclass MCC is:

\[ \mathrm{MCC} = \frac{c\,s - \sum_k t_k p_k} {\sqrt{\left(s^2 - \sum_k p_k^2\right)\left(s^2 - \sum_k t_k^2\right)}}. \]

It reduces to the binary formula when \(K=2\), and can be viewed as a correlation between one-hot encodings of \(y\) and \(\hat y\).

4) NumPy implementation (from scratch)#

We’ll implement:

  • a simple confusion matrix builder

  • MCC for binary and multiclass using the \(K\times K\) formula

  • (optionally) the binary closed-form as a sanity check

def confusion_matrix_np(y_true, y_pred, labels=None):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    if y_true.shape != y_pred.shape:
        raise ValueError("y_true and y_pred must have the same shape")

    if labels is None:
        labels = np.unique(np.concatenate([y_true, y_pred]))
    else:
        labels = np.asarray(labels)

    label_to_index = {label: i for i, label in enumerate(labels.tolist())}

    true_idx = np.fromiter((label_to_index.get(v, -1) for v in y_true), dtype=int, count=y_true.size)
    pred_idx = np.fromiter((label_to_index.get(v, -1) for v in y_pred), dtype=int, count=y_pred.size)

    if (true_idx < 0).any() or (pred_idx < 0).any():
        raise ValueError("labels must contain all values appearing in y_true and y_pred")

    k = labels.size
    cm = np.zeros((k, k), dtype=int)
    np.add.at(cm, (true_idx, pred_idx), 1)
    return cm, labels


def mcc_from_counts(tp, tn, fp, fn):
    tp = np.asarray(tp, dtype=float)
    tn = np.asarray(tn, dtype=float)
    fp = np.asarray(fp, dtype=float)
    fn = np.asarray(fn, dtype=float)

    num = tp * tn - fp * fn
    denom = np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
    return np.where(denom == 0, 0.0, num / denom)


def matthews_corrcoef_np(y_true, y_pred, labels=None) -> float:
    cm, _ = confusion_matrix_np(y_true, y_pred, labels=labels)

    t_sum = cm.sum(axis=1, dtype=float)  # true per class
    p_sum = cm.sum(axis=0, dtype=float)  # predicted per class

    s = float(cm.sum())
    c = float(np.trace(cm))

    num = c * s - float(np.dot(t_sum, p_sum))
    denom = np.sqrt((s**2 - float(np.dot(p_sum, p_sum))) * (s**2 - float(np.dot(t_sum, t_sum))))

    return 0.0 if denom == 0.0 else num / denom


def confusion_counts_binary(y_true, y_pred, positive_label=1):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    true_pos = y_true == positive_label
    pred_pos = y_pred == positive_label

    tp = int(np.sum(true_pos & pred_pos))
    tn = int(np.sum(~true_pos & ~pred_pos))
    fp = int(np.sum(~true_pos & pred_pos))
    fn = int(np.sum(true_pos & ~pred_pos))
    return tp, tn, fp, fn
# Quick sanity check vs scikit-learn

y_true = np.array([1, 1, 1, 0, 0, 0, 0, 0])
y_pred = np.array([1, 0, 1, 0, 0, 1, 0, 0])

cm, labels = confusion_matrix_np(y_true, y_pred)
tp, tn, fp, fn = confusion_counts_binary(y_true, y_pred, positive_label=1)

print("labels:", labels)
print("confusion matrix (rows=true, cols=pred):\n", cm)
print("TP,TN,FP,FN:", tp, tn, fp, fn)

print("MCC (scratch, KxK):", matthews_corrcoef_np(y_true, y_pred))
print("MCC (scratch, binary counts):", float(mcc_from_counts(tp, tn, fp, fn)))
print("MCC (sklearn):", sk_matthews_corrcoef(y_true, y_pred))
labels: [0 1]
confusion matrix (rows=true, cols=pred):
 [[4 1]
 [1 2]]
TP,TN,FP,FN: 2 4 1 1
MCC (scratch, KxK): 0.4666666666666667
MCC (scratch, binary counts): 0.4666666666666667
MCC (sklearn): 0.4666666666666667

4.1 Multiclass sanity check#

MCC supports multiclass via the confusion-matrix generalization. We’ll verify our NumPy implementation against scikit-learn on a simple 3-class example.

# Multiclass sanity check (K=3)

y_true_mc = rng.integers(0, 3, size=500)

y_pred_mc = y_true_mc.copy()
noise = rng.random(size=y_true_mc.size) < 0.25
# if noisy: replace with a random label in {0,1,2}
y_pred_mc[noise] = rng.integers(0, 3, size=int(noise.sum()))

mcc_mc = matthews_corrcoef_np(y_true_mc, y_pred_mc)

cm_mc, labels_mc = confusion_matrix_np(y_true_mc, y_pred_mc)

fig = px.imshow(
    cm_mc,
    x=[f"pred {l}" for l in labels_mc],
    y=[f"true {l}" for l in labels_mc],
    text_auto=True,
    color_continuous_scale="Blues",
)
fig.update_layout(title=f"Multiclass confusion matrix (MCC={mcc_mc:.3f})")
fig.show()

print("MCC multiclass (scratch):", mcc_mc)
print("MCC multiclass (sklearn):", sk_matthews_corrcoef(y_true_mc, y_pred_mc))
MCC multiclass (scratch): 0.759901798558076
MCC multiclass (sklearn): 0.759901798558076

5) Intuition plots#

5.1 MCC as a function of TPR and TNR#

If we fix the class prevalence \(\pi = P(Y=1)\) and imagine a classifier with some \((\text{TPR},\text{TNR})\), the expected confusion counts (for large \(N\)) are:

\[ \text{TP} = N\,\pi\,\text{TPR},\quad \text{FN} = N\,\pi\,(1-\text{TPR}),\quad \text{TN} = N\,(1-\pi)\,\text{TNR},\quad \text{FP} = N\,(1-\pi)\,(1-\text{TNR}). \]

Plotting MCC over \((\text{TPR},\text{TNR})\) shows how both kinds of mistakes affect the score.

def plot_mcc_surface(pi: float, grid_steps: int = 101, title: str | None = None):
    t = np.linspace(0.0, 1.0, grid_steps)
    tpr, tnr = np.meshgrid(t, t, indexing="xy")

    n = 1.0  # scale cancels out in MCC
    tp = n * pi * tpr
    fn = n * pi * (1 - tpr)
    tn = n * (1 - pi) * tnr
    fp = n * (1 - pi) * (1 - tnr)

    z = mcc_from_counts(tp, tn, fp, fn)

    fig = px.imshow(
        z,
        x=t,
        y=t,
        origin="lower",
        aspect="auto",
        zmin=-1,
        zmax=1,
        color_continuous_scale="RdBu",
        labels={"x": "TPR (recall)", "y": "TNR (specificity)", "color": "MCC"},
    )

    fig.add_trace(
        go.Scatter(x=t, y=t, mode="lines", name="TPR = TNR", line=dict(color="black", dash="dash"))
    )

    fig.update_layout(
        title=title or f"MCC surface over (TPR, TNR) with prevalence π={pi:.2f}",
        coloraxis_colorbar=dict(title="MCC"),
    )

    return fig


fig = plot_mcc_surface(pi=0.10)
fig.show()
/tmp/ipykernel_653561/325970358.py:35: RuntimeWarning:

invalid value encountered in divide

5.2 The “accuracy trap” under imbalance#

Consider a dataset where the positive class is rare. A trivial classifier that predicts always negative can achieve very high accuracy, even though it’s useless.

MCC exposes this: constant predictions lead to an MCC of 0.

prevalence = np.linspace(0.001, 0.999, 300)

acc_always_negative = 1.0 - prevalence
mcc_always_negative = np.zeros_like(prevalence)
balanced_acc_always_negative = np.full_like(prevalence, 0.5)

fig = go.Figure()
fig.add_trace(go.Scatter(x=prevalence, y=acc_always_negative, name="accuracy (always predict 0)"))
fig.add_trace(go.Scatter(x=prevalence, y=balanced_acc_always_negative, name="balanced accuracy (always 0)", line=dict(dash="dash")))
fig.add_trace(go.Scatter(x=prevalence, y=mcc_always_negative, name="MCC (always 0)", line=dict(dash="dot")))

fig.update_layout(
    title="Imbalance demo: accuracy can look great while MCC stays 0",
    xaxis_title="Positive prevalence π = P(Y=1)",
    yaxis_title="Metric value",
    yaxis=dict(range=[-0.05, 1.05]),
)
fig.show()

6) Using MCC for optimization: threshold tuning for logistic regression#

MCC is not differentiable with respect to model parameters (it depends on discrete labels), so we typically:

  1. train a probabilistic model with a smooth loss (e.g. log-loss)

  2. choose a decision threshold (or hyperparameters) that maximizes MCC on a validation set

Below is a minimal from-scratch logistic regression and an MCC-based threshold selection.

def add_intercept(X: np.ndarray) -> np.ndarray:
    X = np.asarray(X, dtype=float)
    return np.c_[np.ones((X.shape[0], 1)), X]


def sigmoid(z):
    z = np.asarray(z, dtype=float)

    out = np.empty_like(z, dtype=float)
    pos = z >= 0

    out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
    ez = np.exp(z[~pos])
    out[~pos] = ez / (1.0 + ez)

    return out


def binary_log_loss(y_true, p, eps: float = 1e-15) -> float:
    y_true = np.asarray(y_true, dtype=float)
    p = np.asarray(p, dtype=float)

    p = np.clip(p, eps, 1.0 - eps)
    return float(-np.mean(y_true * np.log(p) + (1.0 - y_true) * np.log(1.0 - p)))


def standardize_fit(X_train: np.ndarray):
    mu = X_train.mean(axis=0)
    sigma = X_train.std(axis=0)
    sigma = np.where(sigma == 0, 1.0, sigma)
    return mu, sigma


def standardize_apply(X: np.ndarray, mu: np.ndarray, sigma: np.ndarray) -> np.ndarray:
    return (X - mu) / sigma


def fit_logistic_regression_gd(
    X: np.ndarray,
    y: np.ndarray,
    lr: float = 0.1,
    n_iter: int = 2000,
    l2: float = 0.0,
):
    X = np.asarray(X, dtype=float)
    y = np.asarray(y, dtype=float)

    n, d = X.shape
    w = np.zeros(d)

    losses = np.empty(n_iter)

    for i in range(n_iter):
        p = sigmoid(X @ w)

        # average log-loss + L2 (skip intercept)
        losses[i] = binary_log_loss(y, p) + 0.5 * l2 * float(np.dot(w[1:], w[1:]))

        grad = (X.T @ (p - y)) / n
        grad[1:] += l2 * w[1:]

        w -= lr * grad

    return w, losses


def safe_div(num, denom):
    num = np.asarray(num, dtype=float)
    denom = np.asarray(denom, dtype=float)
    return np.where(denom == 0, 0.0, num / denom)


def binary_metrics_from_counts(tp, tn, fp, fn):
    tp = np.asarray(tp, dtype=float)
    tn = np.asarray(tn, dtype=float)
    fp = np.asarray(fp, dtype=float)
    fn = np.asarray(fn, dtype=float)

    acc = safe_div(tp + tn, tp + tn + fp + fn)

    precision = safe_div(tp, tp + fp)
    recall = safe_div(tp, tp + fn)
    f1 = safe_div(2 * precision * recall, precision + recall)

    tpr = recall
    tnr = safe_div(tn, tn + fp)
    bal_acc = 0.5 * (tpr + tnr)

    mcc = mcc_from_counts(tp, tn, fp, fn)

    return {
        "mcc": mcc,
        "accuracy": acc,
        "balanced_accuracy": bal_acc,
        "precision": precision,
        "recall": recall,
        "f1": f1,
    }
# Synthetic, imbalanced dataset
X, y = make_classification(
    n_samples=4000,
    n_features=6,
    n_informative=4,
    n_redundant=0,
    n_clusters_per_class=2,
    weights=[0.90, 0.10],
    class_sep=1.2,
    flip_y=0.02,
    random_state=7,
)

X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.4, stratify=y, random_state=7)
X_val, X_test, y_val, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5, stratify=y_tmp, random_state=7)

mu, sigma = standardize_fit(X_train)
X_train_s = standardize_apply(X_train, mu, sigma)
X_val_s = standardize_apply(X_val, mu, sigma)
X_test_s = standardize_apply(X_test, mu, sigma)

X_train_i = add_intercept(X_train_s)
X_val_i = add_intercept(X_val_s)
X_test_i = add_intercept(X_test_s)

w, losses = fit_logistic_regression_gd(X_train_i, y_train, lr=0.15, n_iter=2500, l2=1e-2)

fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(losses.size), y=losses, name="train loss"))
fig.update_layout(title="From-scratch logistic regression (GD): training loss", xaxis_title="iteration", yaxis_title="loss")
fig.show()
# Threshold sweep on validation set
p_val = sigmoid(X_val_i @ w)

thresholds = np.linspace(0.0, 1.0, 401)

y_val_bool = y_val.astype(bool)
pred_pos = p_val[:, None] >= thresholds[None, :]

tp = np.sum(pred_pos & y_val_bool[:, None], axis=0)
fp = np.sum(pred_pos & ~y_val_bool[:, None], axis=0)
fn = np.sum(~pred_pos & y_val_bool[:, None], axis=0)
tn = np.sum(~pred_pos & ~y_val_bool[:, None], axis=0)

metrics = binary_metrics_from_counts(tp, tn, fp, fn)

best_idx = int(np.argmax(metrics["mcc"]))
best_t = float(thresholds[best_idx])

best_acc_idx = int(np.argmax(metrics["accuracy"]))
best_acc_t = float(thresholds[best_acc_idx])

best_t, best_acc_t
/tmp/ipykernel_653561/2526279986.py:70: RuntimeWarning:

invalid value encountered in divide

/tmp/ipykernel_653561/325970358.py:35: RuntimeWarning:

invalid value encountered in divide
(0.305, 0.45)
# Plot how metrics change with the decision threshold

fig = go.Figure()

for name, values in metrics.items():
    fig.add_trace(go.Scatter(x=thresholds, y=values, name=name))

fig.add_vline(x=0.5, line_dash="dot", line_color="gray", annotation_text="t=0.5")
fig.add_vline(
    x=best_acc_t,
    line_dash="dash",
    line_color="gray",
    annotation_text=f"best accuracy t={best_acc_t:.3f}",
)
fig.add_vline(
    x=best_t,
    line_dash="dash",
    line_color="black",
    annotation_text=f"best MCC t={best_t:.3f}",
)

fig.update_layout(
    title="Validation curves vs threshold",
    xaxis_title="threshold",
    yaxis_title="metric value",
    yaxis=dict(range=[-0.05, 1.05]),
)
fig.show()
# Evaluate on the test set and compare thresholds
p_test = sigmoid(X_test_i @ w)


def metrics_at_threshold(t: float):
    y_pred = (p_test >= t).astype(int)
    tp, tn, fp, fn = confusion_counts_binary(y_test, y_pred, positive_label=1)
    m = binary_metrics_from_counts(tp, tn, fp, fn)
    return {k: float(v) for k, v in m.items()}, y_pred


for t in [0.5, best_acc_t, best_t]:
    m, _ = metrics_at_threshold(t)
    print(
        f"t={t:.3f} | MCC={m['mcc']:.3f} | acc={m['accuracy']:.3f} | bal_acc={m['balanced_accuracy']:.3f} | F1={m['f1']:.3f}"
    )

# Confusion matrix for the MCC-optimal threshold
m_best, y_test_pred = metrics_at_threshold(best_t)

cm_test, _ = confusion_matrix_np(y_test, y_test_pred, labels=np.array([0, 1]))

fig = px.imshow(
    cm_test,
    x=["pred 0", "pred 1"],
    y=["true 0", "true 1"],
    text_auto=True,
    color_continuous_scale="Blues",
)
fig.update_layout(title=f"Test confusion matrix (threshold={best_t:.3f}, MCC={m_best['mcc']:.3f})")
fig.show()

print("Test MCC (scratch):", m_best["mcc"])
print("Test MCC (sklearn):", sk_matthews_corrcoef(y_test, y_test_pred))
t=0.500 | MCC=0.578 | acc=0.931 | bal_acc=0.706 | F1=0.567
t=0.450 | MCC=0.601 | acc=0.934 | bal_acc=0.733 | F1=0.607
t=0.305 | MCC=0.567 | acc=0.921 | bal_acc=0.767 | F1=0.609
Test MCC (scratch): 0.5667784612933667
Test MCC (sklearn): 0.5667784612933667

7) Pros, cons, and when to use MCC#

Pros#

  • Uses all of TP/TN/FP/FN (unlike precision/recall which ignore TN/TP)

  • Robust under class imbalance (unlike accuracy)

  • Symmetric: swapping positive/negative labels does not change the value

  • Single interpretable scale (\([-1,1]\)) with a correlation meaning

  • Works for multiclass via the confusion-matrix generalization

Cons / caveats#

  • Can be undefined when predictions (or labels) are constant; commonly returned as 0 by convention

  • Non-differentiable w.r.t. model parameters → not a direct gradient-descent loss

  • Threshold-dependent for probabilistic models; you often need threshold tuning

  • Can be noisy/unstable with very small sample sizes or extremely rare classes

When MCC shines#

  • Imbalanced binary classification where both error types matter (FP and FN)

  • Model selection and threshold tuning when you want a single score that “respects” the full confusion matrix

  • Domains with strong imbalance and asymmetric costs where accuracy is misleading (bioinformatics, fraud, anomaly-ish settings)

8) Exercises + references#

Exercises#

  1. Compute MCC by hand for a few confusion matrices and interpret the sign.

  2. Implement a multiclass demo: generate \(K=3\) labels, perturb predictions, and verify your MCC matches scikit-learn.

  3. On the logistic regression demo above:

    • compare the threshold that maximizes accuracy vs MCC

    • try a more imbalanced dataset (e.g. 99/1) and re-run the threshold sweep

  4. Implement cross-validated model selection where the chosen hyperparameter maximizes validation MCC.

References#

  • Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

  • scikit-learn docs: sklearn.metrics.matthews_corrcoef

  • The phi coefficient (binary correlation) and its relationship to MCC